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Abstract 

We consider the dimensionality-reduction problem (finding a subspace approximation of observed data) for 
contaminated data in the high dimensional regime, where the number of observations is of the same magnitude as 
the number of variables of each observation, and the data set contains some (arbitrarily) corrupted observations. We 
propose a High-dimensional Robust Principal Component Analysis (HR-PCA) algorithm that is tractable, robust 
to contaminated points, and easily kernelizable. The resulting subspace has a bounded deviation from the desired 
one, achieves maximal robustness - a breakdown point of 50% while all existing algorithms have a breakdown 
point of zero, and unlike ordinary PCA algorithms, achieves optimality in the limit case where the proportion of 
corrupted points goes to zero. 

Index Terms 

Statistical Learning, Dimension Reduction, Principal Component Analysis, Robustness, Outlier 

I. Introduction 

The analysis of very high dimensional data - data sets where the dimensionality of each observation is 
comparable to or even larger than the number of observations - has drawn increasing attention in the last 
few decades [1], [2]. For example, observations on individual instances can be curves, spectra, images or 
even movies, where a single observation has dimensionality ranging from thousands to billions. Practical 
high dimensional data examples include DNA Microarray data, financial data, climate data, web search 
engine, and consumer data. In addition, the nowadays standard "Kernel Trick" [3], a pre-processing routine 
which non-linearly maps the observations into a (possibly infinite dimensional) Hilbert space, transforms 
virtually every data set to a high dimensional one. Efforts of extending traditional statistical tools (designed 
for the low dimensional case) into this high-dimensional regime are generally unsuccessful. This fact has 
stimulated research on formulating fresh data-analysis techniques able to cope with such a "dimensionality 
explosion." 

Principal Component Analysis (PCA) is perhaps one of the most widely used statistical techniques 
for dimensionality reduction. Work on PCA dates back as early as [4], and has become one of the 
most important techniques for data compression and feature extraction. It is widely used in statistical 
data analysis, communication theory, pattern recognition, and image processing [5]. The standard PCA 
algorithm constructs the optimal (in a least-square sense) subspace approximation to observations by 
computing the eigenvectors or Principal Components (PCs) of the sample covariance or correlation matrix. 
Its broad application can be attributed to primarily two features: its success in the classical regime for 
recovering a low-dimensional subspace even in the presence of noise, and also the existence of efficient 
algorithms for computation. Indeed, PCA is nominally a non-convex problem, which we can, nevertheless, 
solve, thanks to the magic of the SVD which allows us to maximize a convex function. It is well-known, 
however, that precisely because of the quadratic error criterion, standard PCA is exceptionally fragile, and 
the quality of its output can suffer dramatically in the face of only a few (even a vanishingly small fraction) 
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grossly corrupted points. Such non-probabilistic errors may be present due to data corruption stemming 
from sensor failures, malicious tampering, or other reasons. Attempts to use other error functions growing 
more slowly than the quadratic that might be more robust to outliers, result in non-convex (and intractable) 
optimization problems. 

In this paper, we consider a high-dimensional counterpart of Principal Component Analysis (PCA) that 
is robust to the existence of arbitrarily corrupted or contaminated data. We start with the standard statistical 
setup: a low dimensional signal is (linearly) mapped to a very high dimensional space, after which point 
high-dimensional Gaussian noise is added, to produce points that no longer lie on a low dimensional 
subspace. At this point, we deviate from the standard setting in two important ways: (1) a constant 
fraction of the points are arbitrarily corrupted in a perhaps non-probabilistic manner. We emphasize that 
these "outliers" can be entirely arbitrary, rather than from the tails of any particular distribution, e.g., the 
noise distribution; we call the remaining points "authentic"; (2) the number of data points is of the same 
order as (or perhaps considerably smaller than) the dimensionality. As we discuss below, these two points 
confound (to the best of our knowledge) all tractable existing Robust PCA algorithms. 

A fundamental feature of the high dimensionality is that the noise is large in some direction, with very 
high probability, and therefore definitions of "outliers" from classical statistics are of limited use in this 
setting. Another important property of this setup is that the signal-to-noise ratio (SNR) can go to zero, as 
the £2 norm of the high-dimensional Gaussian noise scales as the square root of the dimensionality. In the 
standard (i.e., low-dimensional case), a low SNR generally implies that the signal cannot be recovered, 
even without any corrupted points. 

The Main Result 

In this paper, we give a surprisingly optimistic message: contrary to what one might expect given the 
brittle nature of classical PCA, and in stark contrast to previous algorithms, it is possible to recover such 
low SNR signals, in the high-dimensional regime, even in the face of a constant fraction of arbitrarily 
corrupted data. Moreover, we show that this can be accomplished with an efficient (polynomial time) 
algorithm, which we call High-Dimensional Robust PCA (HR-PCA). The algorithm we propose here 
is tractable, provably robust to corrupted points, and asymptotically optimal, recovering the exact low- 
dimensional subspace when the number of corrupted points scales more slowly than the number of 
"authentic" samples (i.e., when the fraction of corrupted points tends to zero). To the best of our knowledge, 
this is the only algorithm of this kind. Moreover, it is easily kernelizable. 

The proposed algorithm performs a PCA and a random removal alternately. Therefore, in each iteration 
a candidate subspace is found. The random removal process guarantees that with high probability, one of 
candidate solutions found by the algorithm is "close" to the optimal one. Thus, comparing all solutions 
using a (computational efficient) one-dimensional robust variance estimator leads to a "sufficiently good" 
output. We will make this argument rigorous in the following sections. 

Organization and Notation 

The paper is organized as follows: In Section HI] we discuss past work and the reasons that classical 
robust PCA algorithms fail to extend to the high dimensional regime. In Section HlI] we present the setup 
of the problem, and the HR-PCA algorithm. We also provide finite sample and asymptotic performance 
guarantees. Section |IV] is devoted to the kernelization of HR-PCA. The performance guarantee are proved 
in Section |Vl We provide some numerical experiment results in Section |VIl Some technical details in the 
derivation of the performance guarantees are postponed to the appendix. 

Capital letters and boldface letters are used to denote matrices and vectors, respectively. A k x k unit 
matrix is denoted by 1^. For c G M, [c]"*" = max(0,c).We let Bd = {w G M'^|||w|| < 1}, and Sd be 
its boundary. We use a subscript (■) to represent order statistics of a random variable. For example, let 
wi, ■ ■ ■ ) e IR- Then ■ ■ ■ , is a permutation of wi, ■ ■ ■ , w„, in a non-decreasing order. 
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II. Relation to Past Work 

In this section, we discuss past work and the reasons that classical robust PCA algorithms fail to extend 
to the high dimensional regime. 

Much previous robust PCA work focuses on the traditional robustness measurement known as the 
"breakdown point" [6], i.e., the percentage of corrupted points that can make the output of the algorithm 
arbitrarily bad. To the best of our knowledge, no other algorithm can handle any constant fraction of 
outliers with a lower bound on the error in the high-dimensional regime. That is, the best-known breakdown 
point for this problem is zero. We show that the algorithm we provide has breakdown point of 50%, which 
is the best possible for any algorithm. In addition to this, we focus on providing explicit lower bounds 
on the performance, for all corruption levels up to the breakdown point. 

In the low-dimensional regime where the observations significantly outnumber the variables of each 
observation, several robust PCA algorithms have been proposed (e.g., [7]-[14]). These algorithms can be 
roughly divided into two classes: (i) performing a standard PCA on a robust estimation of the covariance 
or correlation matrix; (ii) maximizing (over all unit-norm w) some r(w) that is a robust estimate of the 
variance of univariate data obtained by projecting the observations onto direction w. Both approaches 
encounter serious difficulties when applied to high-dimensional data-sets: 

• There are not enough observations to robustly estimate the covariance or correlations matrix. For 
example, the widely-used MVE estimator [15], which treats the Minimum Volume Ellipsoid that 
covers half of the observations as the covariance estimation, is ill-posed in the high-dimensional case. 
Indeed, to the best of our knowledge, the assumption that observations far outnumber dimensionality 
seems crucial for those robust variance estimators to achieve statistical consistency. 

• Algorithms that subsample the points, and in the spirit of leave-one-out approaches, attempt in this 
way to compute the correct principal components, also run into trouble. The constant fraction of 
corrupted points means the sampling rate must be very low (in particular, leave-one-out accomplishes 
nothing). But then, due to the high dimensionality of the problem, principal components from one 
sub- sample to the next, can vary greatly. 

• Unlike standard PCA that has a polynomial computation time, the maximization of r(w) is generally 
a non-convex problem, and becomes extremely hard to solve or approximate as the dimensionality 
of w increases. In fact, the number of the local maxima grows so fast that it is effectively impossible 
to find a sufficiently good solution using gradient-based algorithms with random re-initialization. 

We now discuss in greater detail three pitfalls some existing algorithms face in high dimensions. 

Diminishing Breakdown Point: The breakdown point measures the fraction of outliers required to 
change the output of a statistics algorithm arbitrarily. If an algorithm's breakdown point has an inverse 
dependence on the dimensionality, then it is unsuitable in our regime. Many algorithms fall into this 
category. In [16], several covariance estimators including M-estimator [17], Convex Peeling [18], [19], 
Ellipsoidal Peeling [20], [21], Classical Outlier Rejection [22], [23], Iterative Deletion [24] and Iterative 
Trimming [25], [26] are all shown to have breakdown points upper-bounded by the inverse of the 
dimensionality, hence not useful in the regime of interest. 

Noise Explosion: As we define in greater detail below, the model we consider is the standard PCA 
setup: we observe samples y = Ax + n, where A is an n x d matrix, n ~ A/'(0, Im), and n ^ m >> d. 
Thus, n is the number of samples, m the dimension, and d the dimension of x and thus the number 
of principal components. Let ai denote the largest singular value of A. Then, E(||n||2) = y/ni, (in fact, 
the magnitude sharply concentrates around ^/m), while E(||^x||2) = y/tTace{A~^ A) < y/dai. Unless 
(7i grows very quickly (namely, at least as fast as ^/m) the magnitude of the noise quickly becomes 
the dominating component of each authentic point we obtain. Because of this, several perhaps counter- 
intuitive properties hold in this regime. First, any given authentic point is with overwhelming probability 
very close to orthogonal to the signal space (i.e., to the true principal components). Second, it is possible 
for a constant fraction of corrupted points all with a small Mahalanobis distance to significantly change 
the output of PCA. Indeed, by aligning \n points of magnitude some constant multiple of ai, it is easy 
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to see that the output of PCA can be strongly manipulated - on the other hand, since the noise magnitude 
is ^Jm y/n in a direction perpendicular to the principal components, the Mahalanobis distance of 
each corrupted point will be very small. Third, and similarly, it is possible for a constant fraction of 
corrupted points all with small Stahel-Donoho outlyingness to significantly change the output of PCA. 
Stahel-Donoho outlyingness is defined as: 

A \w^yi-medj{w^yj)\ 

Ui = sup — —-^ A f T \\ - 

l|w||=i medfc|w ' Yk - medj(w ' yj)\ 

To see that this can be small, consider the same setup as for the Mahalanobis example: small magnitude 
outliers, all aligned along one direction. Then the Stahel-Donoho outlyingness of such a corrupted point is 
0((Ti/A). For a given authentic sample y^, take v = yi/||yi||. On the projection of v, all samples except 
yj follow a Gaussian distribution with a variance roughly 1, because v only depends on yj (recall that v 
is nearly orthogonal to A). Hence the S-D outlyingness of a sample is of <^{^/m), which is much larger 
than that of a corrupted point. 

The Mahalanobis distance and the S-D outlyingness are extensively used in existing robust PCA 
algorithms. For example. Classical Outlier Rejection, Iterative Deletion and various alternatives of Iterative 
Trimmings all use the Mahalanobis distance to identify possible outliers. Depth Trimming [16] weights 
the contribution of observations based on their S-D outlyingness. More recently, the ROBPCA algorithm 
proposed in [27] selects a subset of observations with least S-D outlyingness to compute the ci-dimensional 
signal space. Thus, in the high-dimensional case, these algorithms may run into problems since neither 
Mahalanobis distance nor S-D outlyingness are valid indicator of outliers. Indeed, as shown in the 
simulations, the empirical performance of such algorithms can be worse than standard PCA, because 
they remove the authentic samples. 

Algorithmic Tractability: There are algorithms that do not rely on Mahalanobis distance or S-D out- 
lyingness, and have a non-diminishing breakdown point, namely Minimum Volume Ellipsoid (MVE), 
Minimum Covariance Determinant (MCD) [28] and Projection-Pursuit [29]. MVE finds the minimum 
volume ellipsoid that covers a certain fraction of observations. MCD finds a fraction of observations whose 
covariance matrix has a minimal determinant. Projection Pursuit maximizes a certain robust univariate 
variance estimator over all directions. 

MCD and MVE are combinatorial, and hence (as far as we know) computationally intractable as the 
size of the problem scales. More difficult yet, MCD and MVE are ill-posed in the high-dimensional setting 
where the number of points (roughly) equals the dimension, since there exist infinitely many zero-volume 
(determinant) ellipsoids satisfying the covering requirement. Nevertheless, we note that such algorithms 
work well in the low-dimensional case, and hence can potentially be used as a post-processing procedure 
of our algorithm by projecting all observations to the output subspace to fine tune the eigenvalues and 
eigenvectors we produce. 

Maximizing a robust univariate variance estimator as in Projection Pursuit, is also non-convex, and thus 
to the best of our knowledge, computationally intractable. In [30], the authors propose a fast Projection- 
Pursuit algorithm, avoiding the non-convex optimization problem of finding the optimal direction, by 
only examining the directions of each sample. While this is suitable in the classical regime, in the high- 
dimensional setting this algorithm fails, since as discussed above, the direction of each sample is almost 
orthogonal to the direction of true principal components. Such an approach would therefore only be 
examining candidate directions nearly orthogonal to the true maximizing directions. 

Low Rank Techniques: Finally, we discuss the recent paper [31]. In this work, the authors adapt tech- 
niques from low-rank matrix approximation, and in particular, results similar to the matrix decomposition 
results of [32], in order to recover a low-rank matrix Lq from highly corrupted measurements M = Lq+Sq, 
where the noise term, 5*0, is assumed to have a sparse structure. This models the scenario where we have 
perfect measurement of most of the entries of Lq, and a small (but constant) fraction of the random entries 
are arbitrarily corrupted. This work is much closer in spirit, in motivation, and in terms of techniques, to 
the low-rank matrix completion and matrix recovery problems in [33]-[35] than the setting we consider 
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and the work presented herein. In particular, in our setting, each corrupted point can change every element 
of a column of A/, and hence render the low rank approach invalid. 

III. HR-PCA: The Algorithm 

The algorithm of HR-PCA is presented in this section. We start with the mathematical setup of the 
problem in Section IIII-A[ The HR-PCA algorithm as well as its performance guarantee are then given in 
Section UlTBl 

A. Problem. Setup 

We now define in detail the problem described above. 

• The "authentic samples" zi,...,Zt e M"* are generated by Zj = Ax^ + rij, where Xj e MJ^ (the 
"signal") are i.i.d. samples of a random variable x, and (the "noise") are independent realizations 
of n ~ A/'(0,/m). The matrix A E IR™^'^ and the distribution of x (denoted by /i) are unknown. 
We do assume, however, that the distribution fi is absolutely continuous with respect to the Borel 
measure, it is spherically symmetric (and in particular, x has mean zero and variance Id) and it has 
light tails, specifically, there exist constants K,C > such that Pr(||x|| > x) < i^exp(— Ca;) for 
all X > 0. Since the distribution /i and the dimension d are both fixed, as m,n scale, the assumption 
that mu is spherically symmetric can be easily relaxed, and the expense of potentially significant 
notational complexity. 

• The outliers (the corrupted data) are denoted Oi,...,o.„_t G and as emphasized above, they 
are arbitrary (perhaps even maliciously chosen). We denote the fraction of corrupted points by A = 

[n — t)/n. 

• We only observe the contaminated data set 

3^ = {yi • • • , Yn} = {Zl, . . . , |J{Oi, . . . , On-t}. 

An element of y is called a "point". 
Given these contaminated observations, we want to recover the principal components of A, equivalently, 
the top eigenvectors, wi, . . . ,Wd of AA'^ . That is, we seek a collection of orthogonal vectors wi, . . . , w^, 
that maximize the performance metric called the Expressed Variance: 

The E.V. is always less than one, with equality achieved exactly when the vectors wi, . . . , have the 
span of the true principal components {wi, . . . , w^}. When d = 1, the Expressed Variance relates to 
another natural performance metric — the angle between wi and wi — since by definition E.V.{wi) = 
cos^(Z(vi^i, wi))Q The Expressed Variance represents the portion of signal Ax being expressed by 
wi, . . . , Wrf. Equivalently, 1 — E.V. is the reconstruction error of the signal. 

It is natural to expect that the ability to recover vectors with a high expressed variance depends on 
A, the fraction of corrupted points — in addition, it depends on the distribution, /i generating the (low- 
dimensional) points X, through its tails. If /i has longer tails, outliers that affect the variance (and hence 
are far from the origin) and authentic samples in the tail of the distribution, become more difficult to 
distinguish. To quantify this effect, we define the following "tail weight" function V : [0, 1] [0, 1]: 

/Ca 
-Ca 



'This geometric interpretation does not extend to tlie case where d > 1, since the angle between two subspaces is not well defined. 
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where Jl is the one-dimensional margin of fi (recall that yU is spherically symmetric), and is such that 
7^([~Ca, Ca] = a). Since yU has a density function, is well defined. Thus, V(-) represents how the tail of 
Jl contributes to its variance. Notice that V(0) = 0, V(l) = 1, and V(-) is continuous since yU has a density 
function. For notational convenience, we simply let V{x) = for x < 0, and V(x) = oo for x > 1. 

The bounds on the quality of recovery, given in Theorems \T\ and |2] below, are functions of t] and the 
function V(-). 

High Dimensional Setting and Asymptotic Scaling: In this paper, we focus on the case where n ~ 
m':^ d and trace(A^A) ^ 1. That is, the number of observations and the dimensionality are of the same 
magnitude, and much larger than the dimensionality of x; the trace of A is significantly larger than 
1, but may be much smaller than n and m. In our asymptotic scaling, n and m scale together to infinity, 
while d remains fixed. The value of ai also scales to infinity, but there is no lower bound on the rate at 
which this happens (and in particular, the scaling of ai can be much slower than the scaling of m and 
n). 

While we give finite-sample results, we are particularly interested in the asymptotic performance of 
HR-PCA when the dimension and the number of observations grow together to infinity. Our asymptotic 
setting is as follows. Suppose there exists a sequence of sample sets {y{i)} = {3^(1), 3^(2), . . . }, where 
for y{j), n(j), m(j), A(j), d{j), etc., denote the corresponding values of the quantities defined above. 
Then the following must hold for some positive constants ci,C2'- 

lim— - = ci; d{j)<C2; m(j) f +oo; 

j^oom(j) (1) 
tTace{A{jy A{j)) t +oo. 



While trace(74(j)^y4(j)) f +oo, if it scales more slowly than ^Jm{j), the SNR will asymptotically 
decrease to zero. 



B. Key Idea and Main Algorithm 

For w G Sm, we define the Robust Variance Estimator (RVE) as V^{y^) = ^ X]!=i l^^y|(i)- This 
stands for the following statistics: project onto the direction w, replace the furthest (from original) 
n — i samples by 0, and then compute the variance. Notice that the RVE is always performed on the 
original observed set y. 

The main algorithm of HR-PCA is as given below. 
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Algorithm 1: HR-PCA 

Input: Contaminated sample-set y = {yi, . . . , y„} C W^, d, T, i. 

Output: wl,...,w^. 

Algorithm: 

1) Let Yi := for i = 1, . . . n; s := 0; Opt := 0. 

2) While s < T, do 

a) Compute the empirical variance matrix 

n—s 

n — s ^-^ 

i=l 

b) Perform PCA on S. Let wi, . . . , be the d principal components of S. 

c) If Ej=i^£(wj) > Opt, then let Opt := E?=i^£(wi) and let w* := wj for 
j = !,■■■ ,d. 

d) Randomly remove a point from {yi}"=i* according to 

d 

Fi{y- is removed) oc ^^^(wjyi)^; 

i=i 

e) Denote the remaining points by {yi}"=i^^^; 

f) s:= s + 1. 

3) Output wl,...,wl. End. 

Intuition on Why The Algorithm Works: On any given iteration, we select candidate directions based on 
standard PCA - thus directions chosen are those with largest empirical variance. Now, given a candidate 
direction, w, our robust variance estimator measures the variance of the (n — t)-smallest points projected in 
that direction. If this is large, it means that many of the points have a large variance in this direction - the 
points contributing to the robust variance estimator, and the points that led to this direction being selected 
by PCA. If the robust variance estimator is small, it is likely that a number of the largest variance points 
are corrupted, and thus removing one of them randomly, in proportion to their distance in the direction 
w, will remove a corrupted point. 

Thus in summary, the algorithm works for the following intuitive reason. If the corrupted points have 
a very high variance along a direction with large angle from the span of the principal components, then 
with some probability, our algorithm removes them. If they have a high variance in a direction "close to" 
the span of the principal components, then this can only help in finding the principal components. Finally, 
if the corrupted points do not have a large variance, then the distortion they can cause in the output of 
PCA is necessarily limited. 

The remainder of the paper makes this intuition precise, providing lower bounds on the probability 
of removing corrupted points, and subsequently upper bounds on the maximum distortion the corrupted 
points can cause, i.e., lower bounds on the Expressed Variance of the principal components our algorithm 
recovers. 

There are two parameters to tune for HR-PCA, namely t and T. Basically, t affects the performance 
of HR-PCA through Inequality [2l and as a rule of thumb we can set t = t when no a priori information 
of /i exists. T does not affect the performance as long as it is large enough, hence we can simply set 
T = n — 1, although when A is small, a smaller T leads to the same solution with less computational 
cost. 

The correctness of HR-PCA is shown in the following theorems for both the finite-sample bound, and 
the asymptotic performance. 

Theorem 1 (Finite Sample Performance): Let the algorithm above output {wi, . . . , w^}. Fix a k > 0, 
and let r = max(m/?i, 1). There exists a universal constant cq and a constant C which can possible 
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depend on i/t. A, d, fi and k, such that for any 7 < 1, if n/\o^n > log^(l/7), then with probability 
1 — 7 the following holds 



E.V.{wi, . . . , Wd} > 



VI- 



A(1+k) 
(1-A)k 



V 



V 



(trace(AA^)) 



-1/2 _ 



2cor 



V 



The last three terms go to zero as the dimension and number of points scale to infinity, i.e., as n.m — )■ 00. 
Therefore, we immediately obtain: 

Theorem 2 (Asymptotic Performance): Given a sequence of if the asymptotic scaling in Ex- 

pression ([T]) holds, and limsup A(j) < A*, then the following holds in probability when j 00 (i.e., when 

n, m t 00), 



liminf E.V.{wi(j), . . . , Wd(j)} > max 



VI 



A*(1+k) 
(1-A*)k 



X 



V 



l-A* 



(2) 



Remark 1: The bounds in the two bracketed terms in the asymptotic bound may be, roughly, explained 
as follows. The first term is due to the fact that the removal procedure may well not remove all large- 
magnitude corrupted points, while at the same time, some authentic points may be removed. The second 
term accounts for the fact that not all the outliers may have large magnitude. These will likely not be 
removed, and will have some (small) effect on the principal component directions reported in the output. 

Remark 2: The terms in the second line of Theorem [U go to zero as n, m — 00, and therefore the 
proving Theorem [T] immediately implies Theorem [2l 

Remark 3: If A(j) I 0, i.e., the number of corrupted points scales sublinearly (in particular, this holds 
when there are a fixed number of corrupted points), then the right-hand-side of Inequality ^ equals iJl 
i.e., HR-PCA is asymptotically optimal. This is in contrast to PCA, where the existence of even a single 
corrupted point is sufficient to bound the output arbitrarily away from the optimum. 

Remark 4: The breakdown point of HR-PCA converges to 50%. Note that since /x has a density function, 
V(a) > for any a G (0, 1]. Therefore, for any A < 1/2, if we set i to any value in (An,t], then there 
exists K large enough such that the right-hand- side is strictly positive (recall that t = (1 — \)n). The 
breakdown point hence converges to 50%. Thus, HR-PCA achieves the maximal possible break-down 
point (note that a breakdown point greater than 50% is never possible, since then there are more outliers 
than samples. 

The graphs in Figure [T] illustrate the lower-bounds of asymptotic performance if the 1 -dimension 
marginal of ^ is the Gaussian distribution or the Uniform distribution. 



IV. Kernelization 

We consider kernelizing HR-PCA in this section: given a feature mapping T(-) : T-L equipped 

with a kernel function k{-,-), i.e., (T(a), T(b)) = k{a,h) holds for all a, b G M™, we perform the 
dimensionality reduction in the feature space T-L without knowing the explicit form of T(-). 

We assume that {T(yi),-- - ,T(y„)} is centered at origin without loss of generality, since we can 
center any T(-) with the following feature mapping 



Tfx) ^ Tfx) 



n 



^We can take = a/ A(j) and note that since pu has a density, V(-) is continuous. 
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Lower Bound of Expressed Variance (Gaussian) 



Lower Bound of Expressed Variance (Uniform) 




(a) Gaussian distribution 

Fig. I. Lower Bounds of Asymptotic Perfoimance. 




(b) Uniform distribution 



whose kernel function is 

fc(a, b) = /c(a, b) - - ^ fc(a, y^) - - J] ^(y^' + ^^y^' ^j)" 

j=l i=l i=l i=l 

Notice that HR-PCA involves finding a set of PCs wi, . . . , G "H, and evaluating (wg, T(-)) (Note 
that RVE is a function of (wg, T(yj)), and random removal depends on (w^, T(yi))). The former can 
be kernelized by applying Kernel PCA introduced by [36], where each of the output PCs admits a 
representation 

= ^aj(g)T(yj). 

Thus, (wg, T(-)) is easily evaluated by 

n—s 

(w„T(v)) = ^a,(g)A:(y,,v); Vv G 
i=i 

Therefore, HR-PCA is kernelizable since both steps are easily kernelized and we have the following 
Kernel HR-PCA. 
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Algorithm 2: Kernel HR-PCA 

Input: Contaminated sample-set y = {yi, . . . , y„} C M"*, d, T, n. 

Output: a*(l),...,CK*(d). 

Algorithm: 

1) Let Yi := yj for i = 1, . . . n; s := 0; Opt := 0. 

2) While s < T, do 

a) Compute the Gram matrix of {y^}: 



Kij := k{yi,yj] 



n — s. 



b) Let crj,--- and q;(1),-- - ,Q:((i) be the d largest eigenvalues and the 
corresponding eigenvectors of K. 

c) Normalize: cx.{q) := a(q)/d-g, so that ||wy|| = L 

d) If T.LiVi{a{q)) > Opt, then let Opt := and let a*(g) := 



a{q) for g = 1, ■ ■ • ,d. 

e) Randomly remove a point from {yj}"r{* according to 

d n—s 

Pr(yi is removed) oc aj(g)fc(yj, yO)' 

<?=! j = i 

f) Denote the remaining points by {yi}"={^^^; 

g) s := s + 1. 

3) Output a*(l),...,a*(rf). End. 



Here, the kernelized RVE is defined as 



n ^ 



(5^«,T(y,),T(y)) 



n 

-E 

1=1 



n 2 



(0 



V. Proof of the Main Result 

In this section we provide the main steps of the proof of the finite-sample and asymptotic performance 
bounds, including the precise statements and the key ideas in the proof, but deferring some of the more 
standard or tedious elements to the appendix. The proof consists of three steps which we now outline. In 
what follows, we let d, m/n. A, t/t, and /i be fixed. We can fix a A G (0, 0.5) without loss of generality, 
due to the fact that if a result is shown to hold for A, then it holds for A' < A. The letter c is used to 
represent a constant, and e is a constant that decreases to zero as n and m increase to infinity. The values 
of c and e can change from line to line, and can possible depend on d, m/n. A, i/t, and yU. 

1) The blessing of dimensionality, and laws of large numbers: The first step involves two ideas; the first 
is the (well-known, e.g., [37]) fact that even as n and m scale, the expectation of the covariance of 
the noise is bounded independently of m. The second involves appealing to laws of large numbers 
to show that sample estimates of the covariance of the noise, n, of the signal, x, and then of 
the authentic points, z = Ax + n, are uniformly close to their expectation, with high probability. 
Specifically, we prove: 

a) With high probability, the largest eigenvalue of the variance of noise matrix is bounded. That 
is. 



sup 



< c. 
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b) With high probability, both the largest and the smallest eigenvalue of the signals in the original 
space converge to 1. That is 

1 * 

sup |- ^(w^Xi)^ - 1| < e. 



t 

i=l 



c) Under [Tbl with high probability, RVE is a valid variance estimator for the dimensional 
signals. That is, 

sup |l^|w^x|2)-vf^l I <e. 



1=1 



d) Under [Til and [Tel RVE is a valid estimator of the variance of the authentic samples. That is, 
the following holds uniformly over all w G Sm, 

{1 - e)\\w'^ AfV (^0 -c||w^A|| < ^^|w^z|^,) < (l + e)||wTAfV (^0 +c||w^A||. 

2) The next step shows that with high probability, the algorithm finds a "good" solution within a 
bounded number of steps. In particular, this involves showing that if in a given step the algorithm 
has not found a good solution, in the sense that the variance along a principal component is not 
mainly due to the authentic points, then the random removal scheme removes a corrupted point 
with probability bounded away from zero. We then use martingale arguments to show that as a 
consequence of this, there cannot be many steps with the algorithm finding at least one "good" 
solution, since in the absence of good solutions, most of the corrupted points are removed by the 
algorithm. 

3) The previous step shows the existence of a "good" solution. The final step shows two things: first, 
that this good solution has performance that is close to that of the optimal solution, and second, 
that the final output of the algorithm is close to that of the "good" solution. Combining these two 
steps, we derive the finite-sample and asymptotic performance bounds for HR-PCA. 

A. Step la 

Theorem 3: Let r = max{m/n, 1). There exist universal constants c and c' such that for any 7 > 0, 
with probability at least 1 — 7, the following holds: 

sup -_^(w n.jj < cr + 



we5„ t n 



Proof: The proof of the theorem depends on the following lemma, that is essentially Theorem 11.13 
in [37]. 

Lemma 1: Let F be an n x p matrix with n < p, whose entries are all i.i.d. A/'(0, 1) Gaussian variables. 
Let si(r) be the largest singular value of F; then 

Pr(si(r) > Vn + ^+y/pe) < exp{-pe^ /2). 
Our result now follows, since snp^e^m 7 ^l=i{^'^ '"^i)'^ is the largest eigenvalue of = (l/t)r7ri, where 
Fi is a m X t matrix whose entries are all i.i.d. M{0, 1) Gaussian variables; and, moreover, the largest 
eigenvalue of W is given by Xw = [■si(ri)]^/t. Specifically, we have 

Pr{Xw > 7^ TT -) 

^ (1 — A)n ' 



, m + t + max(m, t)e^ + 2\fmt + 2-v/(m + t) maxfm, t)e^ 
<Pr(AvK > — -^-^) 



Pr (si(F) > \frfi + \ft + \/ max(m, t)e) < exp(— max(m, t)e^ /2) < exp(— (1 — A)nre^/2). 
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Let 7 equals the r.h.s. and note that A < 1/2, we have that 



sup - > (w^n,)2 < 8r + 8\/ ^ + 

The theorem follows by letting c = 16 and d = 16. 



5. Step lb 

Theorem 4: There exists a constant c that only depends on and rf, such that for any 7 > 0, with 
probability at least 1 — 7, 

.2 _ i„^3 1 



sup 



w^x,-)^ - l| < 



c log n log 



Proof: The proof of Theorem |4] depends on the following matrix concentration inequality from [38]. 

Theorem 5: There exists an absolute constant cq for which the following holds. Let X be a random 
vector in M", and set Z = \\X\\. If X satisfies 

1) There is some p > such that sup^g^^ ((E(w^X)^) ^''^ < p, 

2) ll^'ll^^ < 00 for some a > 1, 
then for any e > 



exp 



Coe 



where Xi are i.i.d. copies of X, d = min(?7,, N), (3 = {1 + 2/a)~^ and 



Ad, 



N 



\tpa ' 



y/hgd{\ogN) 



l/a 



N 



We apply Theorem |5] by observing that 

1 * 



+ \\E{XX^)\\^'^A, 



■ sup 



i=l 

t 



— w^XjX^w — w^E(xx^)w 
1 * 

-5^x,x7-E(xx^ 

i=l 



: sup 

we5„ 



i=l 



One must still check that both conditions in Theorem [5] are satisfied by x. The first condition is satisfied 
because sup^g^^^ E(w^x)'' < E||x||'^ < 00, where the second inequality follows from the assumption that 
||x|| has an exponential decay which guarantees the existence of all moments. The second condition is 
satisfied thanks to Lemma 2.2.1. of [39]. ■ 
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C. Step Ic 

Theorem 6: Fix r] < \. There exists a constant c that depends on d, jj, and rj, such that for all 7 < 1, 
t, the following holds with probability at least 1 — 7: 



t 



w X - V 



, logra + log 1/7 log^''^nlog^/^(l/7) 
< c\ h c , 



n n 



sup -^|w ^|(,) 

We first prove a one-dimensional version of this result, and then use this to prove the general case. We 
show that if the empirical mean is bounded, then the truncated mean converges to its expectation, and 
more importantly, the convergence rate is distribution free. Since this is a general result, we abuse the 
notation /i and m. 

Lemma 2: Given 5 G [0, 1], c G M^' , m,m ^'H satisfying rh < m. Let be i.i.d. samples 

drawn from a probability measure fi supported on and has a density function. Assume that E(a) = 1 
m ^i^i < 1 + c. Then with probability at least 1 — 5 we have 



1^ |.M-^W"0 (2 + c)m /8(21ogm + l + log 



sup 

m<m - Jq m-m \ m 

— 1=1 



1 



where jj, -^{x) = min{z\jj,{a < z) > x}. 

Proof: To avoid heavy notation, let eo = s) ^ _ (1±£^^^_ jj^g ]^gy obtaining 

uniform convergence in this proof relies on a standard Vapnik-Chervonenkis (VC) dimension argument. 
Consider two classes of functions J" = {/e(-) : M+ ^ M+|e G M+} and g = {^e(-) : M+ {0, +l}|e G 
as /e(a) = a - l(a < e) and ge{a) = l(a < e). Note that for any ei > 62, the subgraphs of fe^ and g^^ 
are contained in the subgraph of /eg and respectively, which guarantees that VC{J-') = VC{Q) < 2 
(cf page 146 of [39]). Since ge{-) is bounded in [0,1], /e(-) is bounded in [0,e], standard VC -based 
uniform-convergence analysis yields 



Pr(sup |— >j5(e(ai) -¥.ge{a)\ > eo) < 4exp(21ogm + 1 - mel/8) = - 
p.>n m ^—^ Z 



i=l 

and 



1 / 

Pr( sup |-y;/e(a,)-E/e(a)| >e) <4exp 21og 

e£[0,{l+c)m/{m-rh)] .^^ \ 



m 



'I + cYm / 2 



With some additional work (see the appendix for the full details) these inequalities provide the one- 
dimensional result of the lemma. ■ 
Next, en route to proving the main result, we prove a uniform multi-dimensional version of the previous 
lemma. 

Theorem 7: If sup^g^^ \ \Y^i=\{^^ ^if - l| < c, then 



Pr<; sup |i^|w^x|?,)-V(J)|>e 



< max 



exp 



32(2 + c)2 



' _ IYI2 

Proof: To avoid heavy notation, let bx = A/e/(4 + 4c), 62 = t - i/{{8 + 8c)y/t), and S = 
min(5i,52)- 

It is well known (cf. Chapter 13 of [40]) that we can construct a finite set <Sd C Sd such that \<Sd\ < 
{3/5y, and maxwe^ min^^^^^ ||w — wi|| < 5. For a fixed wi G Sd, note that (w7xi)^, ■ ■ ■ , (w^x^)^ are 
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i.i.d. samples of a non-negative random variable satisfying the conditions of Lemma |2l Thus by Lemma |2] 
we have 

{i-i/t)W 



p.{.up|i|:Kxi?.,-v0) 



>€/2} < Serexp 



32(2 + c 



Thus by the union bound we have 



Pr <( sup 

we5dJ<t 



1 * 



i=l 



> e/2 > < — ^ — exp ' 



32(2 + c) 



Next, we need to relate the uniform bound on with the uniform bound on this finite set. This requires 
a number of steps, all of which we postpone to the appendix. ■ 
Corollary 1: If sup^g^^ \ \ Xli=i(w^Xj)^ — l| < c, then with probability 1 — 7 



1 * 

7E 



sup \- x|(,)-VI - ) I <eo, 



where 



32(2 + c)2{ max[^ logt + log i + log(16e6'^) + f log(l + £),(!- ^7^)2]} 



t(l - i/tf 



+ 



/32(2 + c)2{max[(rf + 2)logt + logi + log(16e224'^) + rflog(l + c) - f log(l - t"/^), (1 -tV^)^]} 



t(l-t/t)2 

Proof: The proof follows from Theorem |7] and from the following lemma, whose proof we leave to 
the appendix. 

Lemma 3: For any Ci, C2, d', t > 0, and < 7 < 1, let 



then 



Vax(t/Mogt-log(7/Ci),C2) 



Cie"'^' exp(-C2e2t) < 7. 



Now we prove Theorem [6l which is the main result of this section. 

Proof: By Corollary [H there exists a constant c' which only depends on d, such that if 



i5^(wW-l|<e, 



then with probability 1 — 7/2 



sup 



1 * 



w^x|2 -V(^)|<c(2 + c 



l\ogt + log 1/7 + log(l + c) - log(l - t/i) 



t{i - i/ty 



Now apply Theorem |4l to bound c by 0(log^ nlog^(l/7)/n), and note that log(l + c) is thus absorbed 
by logn and log(l +7). The theorem then follows. ■ 
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D. Step Id 

Recall that Zj = Axj + rij. 

Theorem 8: Let t' < t. If there exists ei, e2, c such that 



(/) 


sup 


t' 

i5:iw-xij.,-v( 

i=l 


[11) 


sup 

w65d 




{III) 


sup 

we 5,71 


i=l 



then for all w G 5m the following holds: 



i=l 



t' 



2||wTA||v/(l + e2)c 



<(1 + ei)||w^Af V - + 2||wTA||v/(l + e2)c + c. 



Proof: Fix an arbitrary w G iS^- Let and be permutations of [1, ■ ■ ■ , t] such that both 

Iw^z- I and Iw^Ax- I are non-decreasing. Then we have: 

t' t' 

I T |2 {«) I T A , T |2 



i=l 
t' 



i=l 



i=l i=l i=l ) 

L i=l i=l i=l ) 

1 *' 

A||2sup-5^|vTx|2) + 2 



< w 



j=l 



i=l 



<(1 + ei) II w^Af V(t7t) + 2|| w^AII v/(l + e2)c + c. 



Here, (a) and (6) follow from the definition of jj, and (c) follows from the definition of ji and the well 
known inequality < (Ei a?)(E» )• 
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Similarly, we have 

t' , t' 



j=l i=l 

L j = l j=l 2=1 ) 



L i=l i=l i=l 

i=l i=l 



>(1 - 6i)||w"^A|pV(t'A) - 2||w^A||v/(l +62)C, 

where (a) follows from the definition of ji. 

Corollary 2: Let t' < t. If there exists ei, €2, c such that 



(-2 





sup 

wG5d 






sup 


1 * 

IE 

't=l 


(///) 


sup 

we 5m 


1 * 

tE 

1=1 



then for any wi, ■ • ■ , G 5m the following holds 



(1 - ei) V iJ(wi, • ■ ■ , w,) - 2v/(l + e2)cc//J(wi,-- - ,Wrf) 

sEjEiwM?.) 

j=i j=i 

<(1 + ei)V (^0 H{wi, ■ ■ ■ , Wd) + 2v/(l + e2)cc?i/(wi,-- - ,Wrf)+ ' 



where iJ(wi, ■ ■ ■ , w,) ^ ^"=1 K/Af . 
Proof: From Theorem [8l we have that 



j=i ^ ^ i=i j=i i=i 

Note that X]j=i % ^ (i a| holds for any Oi, • • • , a^, we have 

(1 - ei)V i/(wi, ■ ■ ■ , w,) - 2v/(l + e2)crfi/(wi,--- ,w,) 

< E(i - ei)ii wjAirv (y) - 2 E 11^7^11 v/(i+^> 

j=i V ^ i=i 

which proves the first inequality of the lemma. The second one follows similarly. 
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Letting t' = t we immediately have the following corollary. 
Corollary 3: If there exists e, c such that 



w^x|2 - l| < e 



w sup \jYl 

1 * 

(J J) sup -^|w^n,;p < 



then for any wi, ■ ■ ■ , G Sm the following holds: 

(1 - e)H{wi, ■■■ ,Wd)- 2^{1 + e)cdH {wi,- ■■ ,Wd) 

d ^ t 



<> - > Iw7z,f 



t 

j=l i=l 



<(1 + e)H{wi, ■ ■ ■ , Wd) + 2^/(1 + e)cdH{wi,--- ,Wd) + c. 

E. Step 2 

The next step shows that the algorithm finds a good solution in a small number of steps. Proving this 
involves showing that at any given step, either the algorithm finds a good solution, or the random removal 
eliminates one of the corrupted points with high probability (i.e., probability bounded away from zero). 
The intuition then, is that there cannot be too many steps without finding a good solution, since too many 
of the corrupted points will have been removed. This section makes this intuition precise. 

Let us fix a fi: > 0. Let Z(s) and 0{s) be the set of remaining authentic samples and the set of remaining 
corrupted points after the s*'' stage, respectively. Then with this notation, y{s) = Z{s) \JO{s). Observe 
that |3^(s)| = n — s. Let r(s) = y{s — l)\y{s), i.e., the point removed at stage s. Let wi(s), . . . , Wd(s) 
be the d PCs found in the s*^ stage — these points are the output of standard PC A on 3^(s — 1). These 
points are a good solution if the variance of the points projected onto their span is mainly due to the 
authentic samples rather than the corrupted points. We denote this "good output event at step s" by S{s), 
defined as follows: 

d ^ 

3=1 x,eZ{s-l) 3 = 1 o,eO{s-l) 

We show in the next theorem that with high probability, S{s) is true for at least one "small" s, by 
showing that at every s where it is not true, the random removal procedure removes a corrupted point 
with probability at least k/{1 + k). 

Theorem 9: With probability at least 1 — 7, event 8{s) is true for some I < s < sq, where 



^_ 16(l + /.)log(l/7) ^ / (l + /.)log(l/7) ^ 

Remark 5: When k and A are fixed, we have sq/ti — )• (1 + k)\/k. Therefore, Sq < t for (1 + fi:)A < 
k(1 — A) and n large. 

When So > n. Theorem |9] holds trivially. Hence we focus on the case where sq < n. En route to 
proving this theorem, we first prove that when 8{s) is not true, our procedure removes a corrupted point 
with high probability. To this end, let J^s be the filtration generated by the set of events until stage s. 
Observe that 0{s), Z{s),y{s) G Fs- Furthermore, since given y{s), performing a PCA is deterministic, 
£{s + l) G J^s. 

Theorem 10: If £^{s) is true, then 

Pr({r(s)GO(s-l)}|J-._0>T^- 

1 + K 
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Proof: If £'^{s) is true, then 



which is equivalent to 



i=i Zie2(s-i) i=i Oiec'(s-i) 



Zi62(s-1) i=l Oi6Cl(s-l) i=l Oi(iO{s~\) j=l 



Note that 

Pr({r(s)eO(s-l)}|J-,_i) 



>- 



OieO{s-l) 



Here, the second equality follows from the definition of the algorithm, and in particular, that in stage s, 
we remove a point y with probability proportional to J2'j=ii'^ji^)~^yY^ independent to other events. 

■ 

As a consequence of this theorem, we can now prove Theorem |9l The intuition is rather straightforward: 
if the events were independent from one step to the next, then since "expected corrupted points removed" 
is at least k/{1 + k), then after sq = {1 + e){l + K)\n/ k steps, with exponentially high probability all 
the outliers would be removed, and hence we would have a good event with high probability, for some 
s < Sq. Since subsequent steps are not independent, we have to rely on martingale arguments. 

Let T = min{s|£^(s) is true}. Note that since £(s) G J-'s-i, we have {T > s} E Tg-i- Define the 
following random variable 

r |0(T-1)| + ^, ifr<s; 

' + ifr>s. 

Lemma 4: {X^, J^^} is a supermartingale. 

Proof: The proof essentially follows from the definition of X^, and the fact that if E{^s) is true, then 
|(9(s)| decreases by one with probability + k). The full details are deferred to the appendix. 

■ 

From here, the proof of Theorem [9] follows fairly quickly. 
Proof: Note that 

Pr [^E{sr\^ = Pr (T > sp) < Pr [x,, > = Pr > (1 + e)\n) , (3) 

where the inequality is due to |C(s)| being non-negative. Recall that Xq = \n. Thus the probability 
that no good events occur before step sq is at most the probability that a supermartingale with bounded 
incremements increases in value by a constant factor of (1 + e), from An to (1 + e)\n. An appeal to 
Azuma's inequality shows that this is exponentially unlikely. The details are left to the appendix. 



19 



F. Step 3 

Let wi, . . . , Wd be the eigenvectors corresponding to the d largest eigenvalues of AA^ , i.e., the optimal 
solution. Let wl, . . . , be the output of the algorithm. Let wi{s), . . . , Wd{s) be the candidate solution 
at stage s. Recall that H{wi,--- ,Wd) = Xlj=i II^J^H^' ^^'^ notational simplification, let H = 
H(wi, ■■■ ,Wd),Hs = H{w,{s), . . . , Wdis)), and H* 4 H{wl, . . . , w*). 

The statement of the finite- sample and asymptotic theorems (Theorems [T] and |2l respectively) lower 
bound the expressed variance, E.V., which is the ratio H* /H. The final part of the proof accomplishes this 
in two main steps. First, Lemma [5] lower bounds Hg in terms of H, where s is some step for which £{s) 
is true, i.e., the principal components found by the s*'^ step of the algorithm are "good." By Theorem [9l 
we know that there is a "small" such s, with high probability. The final output of the algorithm, however, 
is only guaranteed to have a high value of the robust variance estimator, V — that is, even if there is a 
"good" solution at some intermediate step s, we do not necessarily have a way of identifying it. Thus, 
the next step. Lemma [6l lower bounds the value of H* in terms of the value H of any output w'^, . . . , 
that has a smaller value of the robust variance estimator. 

We give the statement of all the intermediate results, leaving the details of the proof to the appendix. 

Lemma 5: If £{s) is true for some s < sq, and there exists ei, 62, c such that 



(11) 
{III) 



sup 



sup 



sup 



i=l 
i=l 



W X 



2 

(0 



V 



t - So 



1 < 62 



1 * 



< c, 



then 



ei)V 



t - So 



H ~ 2J {1 + e2)cdH 



Lemma 6: Fix a t < t. If V . 

— ^j- 

(I) 
(11) 



sup 



sup 



< (1 + e2)Hs + 2v/(l + e2)cdH, + c. 



Vf{wj) > J2j=i ^i(Wj), and there exists ei, 62, c such that 



1 * 

tE 

i=l 



W"^X||)-V(-)| <6i, 



1 

- > |w'''x 

t ^ ' 

1=1 



2 



V - 



<ei, 



(///) 



(IV) 



sup 



sup 



I7E 

i=l 
1 * 



Iw'^Xi 



W Ui 



1 < £2, 



then 



A 



A 

ei)H{wi ■ ■ ■ , Wrf)V + 2y/{l + e2)cdH{wi - ■■ , w^) + c. 



iJ(w; 



e2)c(i//(w']^ 
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Theorem 11: If Usli ^i^) ^^^Q, and there exists ei < 1, 62, c such that 



t-SQ 



(I) sup |-^ |w^x| 
1 * 

(//) sup |-^ |w^x| 



1=1 



2 M/^ ^ ^ 



(///) sup|l 5: |w-x|?,-v(^-- 

1 * 

{IV) sup |-^ |W^X,|2 - l| < 62 



A 



(y) sup i^lw^n^p < 
we5m 



then 



if 



;i + ei)(l + e2)(l + fi:)V 



(2«: + 4)(1 - ei)V (^f - y^j /(TT^ + 4(1 + + 62) ^(1 + ^2)0^ 

(l + ei)(l + e2)(l + K)vf| 



-1/2 



(4) 



;i+e2)c 



{H)-\ 

;i+6i)(l + 62)V(^f^^ 

By bounding all diminishing terms in the r.h.s. of dH), we can reformulate the above theorem in a 
slightly more palatable form, as stated in Theorem [T] 

Theorem \T} Let r = max{m/n, 1). There exists a universal constant cq and a constant C which can 
possible depend on i/t. A, d, /i and k, such that for any 7 < 1, if n/log^n > log^(l/7), then with 
probability 1 — 7 the following holds 



H* 



> 



V(f-^)V(1 



A(1+k) 
(1-A)k 



8y/coTd 




2cqt 




[v(OJ 




koJ 





c 



log2nlog^(l/7) 



n 



We immediately get the asymptotic bound of Theorem [2] as a corollary: 
Theorem |2] The asymptotical performance of HR-PCA is given by 



H- V 1-^4, V 1-^ 

-^=- > max — ^ 
H ~ 



:i + k)v 



VL Numerical Illustrations 

We report in this section some numerical results on synthetic data of the proposed algorithm. We 
compare its performance with standard PCA, and several robust PCA algorithms, namely Multi-Variate 
iterative Trimming (MVT), ROBPCA proposed in [27], and the (approximate) Project-Pursuit (PP) algo- 
rithm proposed in [30]. One objective of this numerical study is to illustrate how the special properties 
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of the high-dimensional regime discussed in Section |II] can degrade the performance of available robust 
PCA algorithms, and make some of them completely invalid. 

We report the d = 1 case first. We randomly generate an m x 1 matrix and scale it so that its leading 
eigenvalue has magnitude equal to a given cr. A A fraction of outliers are generated on a line with a 
uniform distribution over [—a ■ mag, a ■ mag]. Thus, mag represents the ratio between the magnitude of 
the outliers and that of the signal Axj. For each parameter setup, we report the average result of 20 tests. 
The MVT algorithm breaks down in the n = m case since it involves taking the inverse of the covariance 
matrix which is ill-conditioned. Hence we do not report MVT results in any of the experiments with 
n = m, as shown in Figure [2] and perform a separate test for MVT, HR-PCA and PCA under the case 
that m <tin reported in Figure |4l 

We make the following three observations from Figure [2l First, PP and ROBPCA can breakdown when 
A is large, while on the other hand, the performance of HR-PCA is rather robust even when A is as large as 
40%. Second, the performance of PP and ROBPCA depends strongly on a, i.e., the signal magnitude (and 
hence the magnitude of the corrupted points). Indeed, when a is very large, ROBPCA achieves effectively 
optimal recovery of the A subspace. However, the performance of both algorithms is not satisfactory 
when a is small, and sometimes even worse than the performance of standard PCA. Finally, and perhaps 
most importantly, the performance of PP and ROBPCA degrades as the dimensionality increases, which 
makes them essentially not suitable for the high-dimensional regime we consider here. This is more 
explicitly shown in Figure [3] where the performance of different algorithms versus dimensionality is 
reported. We notice that the performance of ROBPCA (and similarly other algorithms based on Stahel- 
Donoho outlyingness) has a sharp decrease at a certain threshold that corresponds to the dimensionality 
where S-D outlyingness becomes invalid in identifying outliers. 

Figure |4] shows that the performance of MVT depends on the dimensionality m. Indeed, the breakdown 
property of MVT is roughly 1/m as predicted by the theoretical analysis, which makes MVT less attractive 
in the high-dimensional regime. 

A similar numerical study for d = 3 is also performed, where the outliers are generated on 3 random 
chosen lines. The results are reported in Figure [51 The same trends as in the d = 1 case are observed, 
although the performance gap between different strategies are smaller, because the effect of outliers are 
decreased since they are on 3 directions. 

VII. Concluding Remarks 

In this paper, we investigated the dimensionality-reduction problem in the case where the number and 
the dimensionality of samples are of the same magnitude, and a constant fraction of the points are arbi- 
trarily corrupted (perhaps maliciously so). We proposed a High-dimensional Robust Principal Component 
Analysis algorithm that is tractable, robust to corrupted points, easily kernelizable and asymptotically 
optimal. The algorithm iteratively finds a set of PCs using standard PCA and subsequently remove a 
point randomly with a probability proportional to its expressed variance. We provided both theoretical 
guarantees and favorable simulation results about the performance of the proposed algorithm. 

To the best of our knowledge, previous efforts to extend existing robust PCA algorithms into the 
high-dimensional case remain unsuccessful. Such algorithms are designed for low dimensional data sets 
where the observations significantly outnumber the variables of each dimension. When applied to high- 
dimensional data sets, they either lose statistical consistency due to lack of sufficient observations, or 
become highly intractable. This motivates our work of proposing a new robust PCA algorithm that takes 
into account the inherent difficulty in analyzing high-dimensional data. 
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Appendix 

In this appendix, we provide some of the details omitted in Section |Vl 
A. Proof of Lemma |2] 

Lemma |2] Given 5 E [0,1], c E M"*", rh,m E N satisfying m < m. Let ai, ■ ■ ■ , be i.i.d. samples 
drawn from a probability measure fi supported on and has a density function. Assume that E(a) = 1 
and — Y^^i < 1 + c. Then with probability at least 1 — 5 we have 



m<rh '"' Jo 



(2 + c)m /8(21ogm + l + log^ 



S' 



m — m y m 



where ji ^(x) = m\\i{z\ji{a < z) > x}. 

Proof: In Section |Vl using VC dimension argument, we showed that 

1 ™ 5 
Pr(sup| — ^^e(ai) -%e(a)| > eo) < 4exp(21ogm + 1 - = -; (5) 

e>0 . , ^ 

— j=l 
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and 



Pr( 



sup 



ee[0,(l+c)m./(m.-rn)] 1^ ^^-^ 



^ m . 

-V/e(a.)-E/e(a)| >e) <4exp 21ogm + l 

A — 1 \ 



[1 + c)^m 



(6) 



To complete the proof, define h{-) : [0, 1] — t- M'^ as h{x) = adfi. Since /i is supported on by 

Markov inequality we have for any n < m. 



«{n) < a{n+l) < 



1^1 + c)m 



m — n 



due to E(a) = 1 and ^ Zli=i < 1 + c. Similarly, by Markov inequality we have for any c/, e G [0, 1] 
such that (i + e < 1, the following holds: 

h{d + e) - h{d) h{l)-h{d) 



which implies 



Let e„ = a(„) for n < m, we have 



(c? + e) - 



/i(ci + e) - h{d) < 



1-d' 



(7) 



sup 



< 



i=l 
f 1 ^ 

!.^P 1 1 + - h(m/m)\ 



m<rh 



i=l 



f 1 

m<rh III . 

\^ 1=1 



— h{m/m)\ 



< sup 

rn<rh 



^ rn 



1 = 1 



e^ia(i)) - E/e_(a)| +^up |E/e_,(a) - /i(m7m)|. 

m'<m 



With probability at least 1 — 5/2, the first term is upper bounded by e due to Inequality To bound 
the second term, we note that from Inequality Q, with probability at least 1 — 5/2 the following holds 



^ m 

sup I m /m - /i( [0, Cm'] ) I = _sup | — ^ Qe-, (a^) - ¥.g^, (a) | < 



m'<rh 



m'<rh 



i=l 



This is equivalent to with probability 1 —7/2, /i ^(m'/m — eo) < ejn' < jj, ^(m'/m + eo) holds uniformly 
for all fn' < m. Note that this further implies 

_sup |E/e_,(a) - hint /m)\ 

m'<rh 

< sup max \h{m /m + eo) — h{fn' /m), h{m /m)) — h{jn /m — eo)] 

m'<m 

meo meo 

< sup =7 = 

rn'<rh m — 771 m — m 

where the last inequality follows from (|Al). This bounds the second term. Summing up the two terms 
proves the lemma. ■ 
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B. Proof of Theorem 

Theorem H If sup^^^^ |i ELiIw^x^)' - 1| < c, then 



Pr <^ sup 



< max 



lEiwM.,-v(^)|>. 



cd/2 



exp 



e2(l -t7t)2t 
32(2 + c)2 



Proof: In Section |Vl we cover Sd with a finite e-net, and prove a uniform bound on this finite set, 
showing 



I 



Pr < sup 



1=1 



> e/2 < 



32(2 + c)2 



exp 



We have left to relate the uniform bound on with the uniform bound on this finite set. 
For any w, wi G Sd such that || vi^ — wi || < 5 and t < t, we have 



|2 I 



1=1 i=l 

iX:[(w^x.f-(w7x.n|, |^X:[(w^x. 



(8) 



< max 



i=l 



1=1 



where (xi, ■ ■ ■ , Xt) and (xi, • ■ ■ ,Xt) are permutations of (xi, ■ ■ ■ ,Xt) such that |w^Xj| and jw^Xjl are 
non-decreasing with i. 

To bound the right hand side of ([8]), we note that 



1 * 



i=l 



^ J][(w^x,)^-((wi-w + w)'rx^2^ 



i=l 



5^[(wi - ^Y±,f + {[(wi - w)Tx.][w^x,]} 



i=l 



1=1 



(9) 



< max 

veSa t 



t t 

S^^Y v^x,x7v + 26 max(- V I v'^ 



W X 



i=l 



i=l 



Here the inequality holds because ||w — wi|| < 5, and |w^Xj| is non-decreasing with i. 
Note that for all v, v' G Sd, we have 



1 * 1 * 



j=l 



V XiX,- V < 1 + c: 



(//) i^|v-x.|<itl-"*.l£ 

i=l j=l ^ " i=l 

t t 

(III) Iw^^il' < Iw^Xil' < t{l + c); ^ Iw^Xfl < 



\ ^=l 



i=t+l 



i=l 



t-i 
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Here, (a) holds because |w^Xj| is non-decreasing with i. Substituting it back to the right hand side of ^ 
we have 



i=l 



< (1 + c)6' + 2(1 + c)6J . < e/2. 



t 



t - 1 



Similarly we have 



iX:[(wW-(w7x.r] = lX][((w, + w-w,rx.r-(w7x.)^] 

1=1 i=l 
t t 

5^[(w - wi)^x]2 - {[(w^ - w0^x,][w7x,]} 

i=l i=l 

1 * _ _ 1 * ^ 

<max5^- > v^XjxJv + 25 max(- > |v' 



Wi X 



1 ^th 



where the last inequality follows from that jw^^fl is non-decreasing with i. Note that the non-decreasing 
property also leads to 



|Wi X£| < 



t-i '' 



which implies that 



and consequently 



Thus, 



|^J][(w^x,f-(w7x,f]|<6/2, 
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Set -rrexp 
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exp 
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The first inequality holds because there exists E <Sd such that ||vi^ — vi^i|| < S, which implies 

|}ELi|w^-l^)-iELi|wi^-l?.)|<^/2. 
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C. Proof of Corollary [7] and Lemma \3\ 

Corollary [1] If sup^g^^ |i Z^!=i(w^Xj)^ — l| < c, then with probability 1 — 7 

i 



where 



32(2 + c)2{ max[^logt + log i + log(16e6'^) + f log(l + c), (1 - i/tf]] 



t{l - t/t)^ 



+ 



/32(2 + c)2{max[(d + 2)logt + logi + log(16eW) + c/log(l + c) - f log(l - t'/t), (1 -t/t)2]} 



t(l -t/t)2 

Proof: The proof of the corollary requires Lemma [3l 
Lemma |3] For any Ci, C2, (i', t > 0, and < 7 < 1. Let 



e 



'max(c/'logt-log(7/Ci),C2) 



2 



then 

Cie-'^' exp(-C2e2t) < 7. 

Proo/.- Note that 

-C2e2t-rf'loge 

= — max((i' logt — log(7/Ci), C2) — d' log[max((i' logt — log(7/Ci)/C2, 1)] + d' \ogt. 

It is easy to see that the r.h.s is upper-bounded by log(7/Ci) if d'logt — log(7/Ci) > C2. If d'\ogt — 
log(7/Ci) < C2, then the r.h.s equals —C2 + d'logt which is again upper-bounded by log(7/Ci) due to 
d'logt — log(7/Ci) < C2. Thus, we have 

-C2e2t-d'loge<log(7/Ci), 

which is equivalent to 

Cie^'^' exp{-C2eH) < 7. 

■ 

Now to prove the corollary: let 



£2 



32(2 + c)2{ max[^^ logt + logi + log(16e6'^) + f log(l + c), (1 - i/ty]} 



'32(2 + c)2{max[(rf + 2)logt + logi + log(16e224'^) + c/log(l + c) - f log(l - t/^), {1-i/ty]} 



t{i-t/ty 

By Lemma [3l we have 
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By Theorem |7] we have 
Pr<j sup 

l,t<i 



^ |w^x|^,) -Vr-]\>eo 

we5d,t<t i=i ^ 



< max 



d/2 



< 



exp 



d/2 



exp 



32(2 + c)2 
32(2 + c)2 



< ' exp -^^—-LL^ + — exp 
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32(2 + c) 



\{t - tyi^ 



<7- 



The third inequality holds because ei, e2 < cq. 



6l(i-tA)^t 

32(2 + c)2 



D. Proof of Theorem |P| an J Lemma 
Recall the statement of Theorem [9l 

Theorem |9] With probability at least 1 — 7, Usli ^(■5) true. Here 



^ (1 + /t)An 
So = (1 + e) ; 



16(1 + k) log(l/7) 



+ 4 



;i + «:)log(l/7) 



K nXn V nXn 

Recall that we defined the random variable Xg as follows: Let T = min{s|£^(s) is true}. Note that since 

S{s) G -Fs-i, we have {T > s} G J^s-i- Then define: 

r |0(T-1)| + ^, ifT<s; 
' l|0(.)| + ^, ifT>s. 

The proof of the above theorem depends on first showing that the random variable, X^, is a supermartingale. 
Lemma m {X^, is a supermartingale. 

Proof: Observe that Xg & Tg- We next show that E(Xs|J^j_l) < Xg^i by enumerating the following 
three cases: 

Case 1, T > s: Thus we have S'^{s) is true. By Lemma [TOl 



E(X,-X,_i|J-, 



E C(s)-C(s-1) 



K 



1 + K 



1 + K 



-Pr (r(s) G C(s- 1)) < 0. 



Case 2,T = s: By definition of Xg we have X, = 0{s - 1) + fi:(s - 1)/(1 + k) = Xg_i. 
Case 3, T < s: Since both T and s are integer, we have T < s — 1. Thus, Xg^i = 0(T — 1) + k,{T 
l)/{l + ^)=Xg. 

Combining all three cases shows that E{Xg\J's_i) < Xs_i, which proves the lemma. 
Next, we prove Theorem [9l 
Proof: Note that 



so 



Pr f|^( 



Pr(r>So)<Pr X,„> 



1 + K 



Pr(X,„ > (l + e)An) 



(10) 



where the inequality is due to |0(s)| being non-negative. 



Let i/g = Xg — Xg_i, where recall that Xq = An. Consider the following sequence: 

y's-Vs- Hvslyi, ■ ■ ■ 
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Observe that {y'^} is a martingale difference process w.r.t. {Ts}- Since {Xs} is a supermartingale, 
^{Uslyi^ ■ ■ ■ ) Us-i) < a.s. Therefore, the following holds a.s., 



i=l 



i=l i=l i=l 

By definition, \ys\ < 1, and hence \y'J < 2. Now apply Azuma's inequality 

Pr(X,„ > (1 + e)An) 

so 

<Pr((E2/0>eAn) 

< exp(-(eAn)V8so) 
= exp 



< 



exp 



i{l + e){l + K)Xn 



8(l + e)(l + fi:)An 

/ e^AriK \ 
< max exp —--77- r , exp 



16(1 + «;);' V 16(1 + 
We claim that the right-hand-side is upper bounded by 7. This is because: 



and 



e > 



e > 



16(1 + log(l/7) , 
nXn 

16(l + ^)log(l/7) . 
nXn ' 



exp 



exp 



z'^XnK, 



16(1 + K 

eXriK, 
16(1 + k) 



< 



7; 



Substitute into (flOl) . the theorem follows. 



E. Proof of Lemmas |5] and |6| anJ Theorems [77] an J Q] 

We now prove all the intermediate results used in Section IV-FI 

Lemma m If £{s) is true for some s < sq, and there exists ei,e2,c such that 





sup 
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sup 


1 T |2 
7 -^1 
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(III) 


sup 

we5m 
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,t - So. 



1 < 62 



<ei 



then 



1 + K 



(1 - ei)V ( ) - 2 V (1 + e2)cdi/ 



(11) 



< (1 + e2)iJs + 2v/(l + e2)cdiJ, 



Proof: If £"(5) is true, then we have 



(w,(s)"'oi)l 



i=i Zie2(s-i) 



i=i oieci(s-i) 
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Recall that y{s - 1) = Z{s - 1) U 0{s - 1), and that Z{s - 1) and 0{s - 1) are disjoint. We thus have 

r^E E K(^)V^)'<E E K(^)^-o^- (12) 

i=i y,ey{s-i) j=i z,e2(s-i) 

Since wi(s), ■ ■ ■ , Wd(s) are the solution of the s^^ stage, the following holds by definition of the algorithm 



E E KW<E E i^A^ry 

i=i y^&y{s~l) j=i y,ey(s-i) 

Further note that by Z(s — 1) C y(^s — 1) and Z{s — 1) C Z, we have 



(13) 



E E (wJz.)^<E E 



j=i z,e^(s-i) 



and 



3=1 Xi(iZ{s-l) j=l z,eZ j=l i=l 

Substituting them into (fT2l) and (fT3l) we have 

z (wJz.)^<x:x:kw"-.)'- 

i=l Zie2(s-1) j=l i=l 

Note that - 1)| > t - (s - 1) > t - sq, hence for all j = 1, ■ ■ ■ , d. 



j=l 



1=1 



z,62{s-l) 



which in turn implies 



^ So d t 

iT^E 1-^-1(0 < EE' 

j=l 7=1 i=l 



Wj ( S ) Z 



T„ n2 



By Corollary [2] and Corollary [3] we conclude 



1 + K 



(1 - ei)V ( ) i/ - 2^ (1 + e2)cdH 



< (1 + e2)i/« + 2v/(l + e2)cdHs + c. 



Lemma El Fix a t < t. If J2j=i Vii^]) > J2j=i ^t(Wj), and there exists ei, e2, c such that 

(J) sup |7El^^^l»-'^'(7)l ^^1' 



i=l 



1 * 

(JJJ) sup 1-5^ 
1 * 

{IV) sup Iw^riip < a 



<ei, 



Iw'^XiP - l| < 62, 



j=l 
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then 



<(1 + ei)iJ(wi • ■ ■ , Wrf)V + 2v/(l + e2)crfiJ(wi • ■ ■ , w^) + c. 
Proof: Recall that F£(w) = ^ Iw^ylJ^). Since 3^ C Z and \Z\y\ = \n = Xt/{1 - A), we have 

f At 

'^T^ t t 



i=l 



i=l 



By assumption X]j=i ^ii'^j) ^ Sj=i ^(('^j)' we have 



d t 



j=l i=l j=l i=l 

By Corollary |2] and Corollary |3] we conclude 

(1 - 60V Q - i/(w; ■ ■ ■ , w;,) - 2^(1 + e,)cdHiw[- ■■ 

<(1 + ei)if (wi ■ ■ ■ , Wrf)V + 2v/(l + e2)crfiJ(wi ■ ■ ■ , w^) + c. 
Theorem [Til If IJsli^l'^) true, and there exists ei < l,e2,c such that 



t-SO 



1 _ f _ e 

(/) sup|i$:|w^x|?,)-V(^)|<e, 



(//) sup |i^|w^x|?,)-V(^)| <ei 



i=l 
t 



1=1 



1 

(///) sup |- |w'''x 

1 * 

{IV) sup |-^ |w^Xi|2 - l| < 



2 _ ,t A 

« i_A 



<ei 



£2 



(y) sup ^ |w"^nj|^ < c, 



then 



^ g - (f - T^j V (^) 

^ " (l + ei)(l + e2)(l + /.)v(|) 

(2/€ + 4)(1 - ei)V (1 - ^) V(l + e2)crf + 4(1 + + €2) ^(1 + £2)0^/' 



(l-ei)V(l-^ C+(1 + 62)C 



(l + ei)(l + e2)V 



(l + ei)(l + e2)(l + fi:)V 



-1/2 



(14) 
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Proof: Since U!li ^{^) ^^VlQ, there exists a s' < sq such that S{s') is true. By Lemma [5] we have 



1 + K 



(1 - ei)V ( ) - 2^(1 + e,)-cdH 



< (1 + e2)i/s' + 2v/(l + e2)cdH,, + c. 



By the definition of the algorithm, we have X]j=i ^t(w*) > Z]j=i ^ti^A^'))^ which by Lemma [6] implies 

(1 - ei)V Q - H,. - 2^{l + e2)cdH,, < (1 + ei)i/*V + 2^(1 + e2)crfi/* + c. 



,t 1 - A 

By definition, Hs', H* < H. Thus we have 



1 + K 



(1 - ei)V ( ) if - 2^/(1 + e2)crfif 



< (1 + e2)i/.' + 2a/(1 + e2)crfif + c; 



a^) (i-ei)v - 



A 



^ ^_ ^. H,, -2^{1 + t2)cdH < (1 + ei)V ( - ) i/* + 2 V(l + e2)cdH + c. 



Rearrange the inequalities, we have 

,t - So- 



(J) (1 - ei)V(^^)i/ - (2ft: + 4)-y(l + 62)0^// - (1 + n)c < (1 + «:)(! + 62)^/..; 
(//) (1 - ei)V Q - Hs' < (1 + ei) V ff* + {1 + e2)cdH + c. 



Simplify the inequality. We get 



^ " (l + ei)(l + e2)(l + «:)v(f) 

"(2k + 4)(1 - ei)V (f - v/(l + ^2)cd + 4(1 + k)(1 + 62) ^(1 + £2)0^ 

(l + ei)(l + e2)(l + ^)v(i 

(F)-\ 



-1/2 



(l-6i)V f-^ c+(l + e2) 



(l + ei)(l + e2)V 



Theorem [H Let r = max(m/n, 1). There exists a universal constant cq and a constant C which can 
possible depend on i/t. A, d, fx and k, such that for any 7 < 1, if n/log^n > log^(l/7), then with 
probability 1 — 7 the following holds 



H* 



> 



V(f-^)V(1 



A(1+k) 
(1-A)k 



8y/cQTd 



V 



-1/2 



2cor 



V 



(i/)-^-C 



log' n log' (1/7) 



n 



Proof: We need to bound all diminishing terms in the r.h.s. of (fT4|) . We need to lower bound 
V((t — so)/t) using the following lemma. 
Lemma 7: 



V 



(1- A)k 



where e < c^^^^^ + ca /'^^^^ 
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Proof: Given a < a"*" < 1, by the definition of V we have 



V(a+) - V(a") ^ 1 - V(a^ 



Re-arranging, we have 



V(a-) > L^v(a+) - > V(a+) - 



1 - a+ ' ' l-a+ - ' ' 1 - a^ 



Recall So = (1 + e)(l + K)\n/ k = (1 + e)(l + K)\t/{K{1 - A)). Let s' = (1 + K)\t/{K{1 - A)). Take 
= t — s', and = t — sq, the lemma follows. ■ 
We also need the following two lemmas. The proofs are straightforward. 
Lemma 8: For any < ai, ^2 < 1 and c > 0, we have 

1 - a < 1/(1 + a); (1 - ai)(l - 0:2) < 1 - (ai + 02); a/c + ai < a/c + ai. 
Lemma 9: If n/\og^n > log^(l/7), then 



maxli^^^ / lQg(V7) log'-'nlog^-^(l/7) \ ^ log^ n log^(l/7) 



n \ n \ n n j \/n 

Recall that with probability 1 — 7, c < Cqt + c^^^^^dJl where cq is a universal constant, and the constant 
c depends on k, t/t. A, and [i. We denote c — CqT by ec. Iteratively applying Lemma [8l we have the 
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following holds when ei, e2, ec < 1, 

^ (1 - (j - T^) V (^ ) 

^ " (l + ei)(l + e2)(l + K)v(f 

■(2fi: + 4)(1 - ei)V (f - ^) ^/(l + ^2)cd + 4(1 + + 62)^/(1 + e2)crf 



(l + ei)(l + e2)(l + /€)V 



(^) 



-1/2 



(l-ei)V 



t A_ 

t 1-A 



c+ (l + e2)c 



> 



(l + ei)(l + e2)V 



— e 



> 



(l + ei)(l + e2)(l + K)V 
(1 - ^1)^(1 -^2)v(f-^)v(^) 



4v^ + 4^/(1 + e2)cc/ 



V 



-1/2 _ 



2c 



— e 



> 



(l + K)V(f^ 
(l-15max(ei,62))v(|-^)V(^f^) 



4Virf + 4A/(l + e2)cc? 



V I 



V 



-1/2 



-1 



2c 



V 



(1 + k)V 

4v/(cor + 6^)^ + 4(1 + 62) ^{cqT + e,)d 





2{cqt + 











> 



(l-15max(ei,e2))V(f-^)v(^f^) 



4(Vc^ + ec)v^ + 4( 1 + e2)(v/q;? + ee)v^ 
V 



(F)-^/2 - 


2(cor + ec) 






[ v(0 J 





Recall that with probability 1 - 7, €2 < c'^^l^^^ilh ^ ^ < ci2i^ + c^J'^^^, c < cqt + c 



log(l/7) 



Furthermore, ei < cy^iHszi±MiM + ^ log -niog ■ = ^ < 1, and ei < max(cy^^^^^^^±^^M + 

c— — ^^^^^ — (IM ^ if t = t. Here, cq is a universal constant, and the constant c depends on k, 77, A, and 
ji. Further note by Lemma |9] we can bound all diminishing terms by ^"^^^"^ • Therefore, we have 

when ei,e2,ec < 1, 



H* 
■=■ > 
H ~ 



V(f-^)V(1 



A(1+k) 
(1-A)k 



(1 + /€)V 



8\/co'7"(i 


(:h^)-i/2 _ 


2cor 


(//)-i 


iv(l)J 









log2nlog3(l/7) 
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On the other hand, when max(ei, e2, ec) > 1, since by Lemma |9l max(ei, e2, ec) < C; 



some constant C2. Thus, C2 



log^ n log'' (1/7) 



> 1. Therefore, when max(ei, €2, Cc) > 1, 



H* 



V 



> > 



f A_ 

t 1-A 



^ ' (1-A)k 



Let C = max(Ci, C2), we proved the that 



-1/2 _ 



2cor 



V 



log^nlog3(l/7) 



log2nlog3(l/7) 



n 



> 



V(f-^)V(1 



A(1+k) 
(1-A)k 



(1 + k)V 



8-\/coT(i 



V 



-1/2 



2cor 



V 



log2nlog3(l/7) 



n 



